Bootstrapping structured page segmentation

نویسندگان

  • Huanfeng Ma
  • David S. Doermann
چکیده

In this paper, we present an approach to the bootstrapping learning of a page segmentation model. The idea evolves from attempts to segment dictionaries that often have a consistent page structure, and is extended to the segmentation of more general structured documents. In cases of highly regular structure, the layout can be learned from examples of only a few pages. The system is first trained using a small number of samples, and a larger test set is processed based on the training result. After making corrections to a selected subset of the test set, these corrected samples are combined with the original training samples to generate bootstrap samples. The newly created samples are used to retrain the system again to refine the learned features and resegment the test samples. This procedure is applied iteratively until the learned parameters are stable. Using this approach, we do not need to provide a large group of training set initially, and by bootstrapping, the results can be refined step by step. We have applied this segmentation to many structured documents such as dictionaries, phone books, spoken language transcripts, and obtained satisfying segmentation performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Persian Printed Document Analysis and Page Segmentation

This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...

متن کامل

Title of dissertation : ADAPTIVE ANALYSIS AND PROCESSING OF STRUCTURED MULTILINGUAL

Title of dissertation: ADAPTIVE ANALYSIS AND PROCESSING OF STRUCTURED MULTILINGUAL DOCUMENTS Huanfeng Ma, Doctor of Philosophy, 2006 Dissertation directed by: Professor Rama Chellappa Dr. David S. Doermann Electrical and Computer Engineering Department Digital document processing is becoming popular for applications to office and library automation, bank and postal services, publishing houses a...

متن کامل

Poorly Structured Handwritten Documents Segmentation using Continuous Probabilistic Feature Grammars

This work deals with poorly structured handwritten documents segmentation such as pages of handwritten notes produced with pen-based interfaces. We propose to use a formalism, based on Probabilistic Feature Grammars, that exhibit some interesting features. It allows handling ambiguities and to taking into account contextual information such as spatial relations between objects in the page.

متن کامل

Chinese word segmentation model using bootstrapping

We participate in the CIPS-SIGHAN2010 bake-off task of Chinese word segmentation. Unlike the previous bakeoff series, the purpose of the bakeoff 2010 is to test the crossdomain performance of Chinese segmentation model. This paper summarizes our approach and our bakeoff results. We mainly propose to use χ statistics to increase the OOV recall and use bootstrapping strategy to increase the overa...

متن کامل

Reverse Engineering Method of Web Application to UML Presentation Model Using Vision Based Segmentation Method

In recent years, many web applications are available to use. Most of these applications are poorly modeled or not modeled at all. One of the main modeling techniques is presentation modeling in which the layout of the page is shown. In this paper we present a new reverse engineering method, which takes a web page as input and returns a UML presentation model that represents the page. We applied...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003